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1.  Introduction 

Stochastic  approximation  is  concerned  with  schemes  converging  to  some  sought  value 
when,  due  to  the  stochastic  nature  of  the  problem,  the  observations  involve  errors.  The 
interesting  schemes  are  those  which  are  self-correcting,  that  is,  in  which  a  mistake  always 
tends  to  be  wiped  out  in  the  limit,  and  in  which  the  convergence  to  the  desired  value  is 
of  some  specified  nature,  for  example,  it  is  mean-square  convergence.  The  typical  ex¬ 
ample  of  such  a  scheme  is  the  original  one  of  Robbins-Monro  [7]  for  approximating,  under 
suitable  conditions,  the  point  where  a  regression  function  assumes  a  given  value.  Rob¬ 
bins  and  Monro  have  proved  mean-square  convergence  to  the  root;  Wolfowitz  [8]  showed 
that  under  weaker  assumptions  there  is  still  convergence  in  probability  to  the  root;  and 
Blum  [1]  demonstrated  that,  under  still  weaker  assumptions,  there  is  not  only  conver¬ 
gence  in  probability  but  even  convergence  with  probability  1.  Kiefer  and  Wolfowitz  [6] 
have  devised  a  method  for  approximating  the  point  where  the  maximum  of  a  regression 
function  occurs.  They  proved  that  under  suitable  conditions  there  is  convergence  in  prob¬ 
ability  and  Blum  [1]  has  weakened  somewhat  the  conditions  and  strengthened  the  con¬ 
clusion  to  convergence  with  probability  1. 

The  two  schemes  mentioned  above  are  rather  specific.  We  shall  deal  with  a  vastly 
more  general  situation.  The  underlying  idea  is  to  think  of  the  random  element  as  noise 
superimposed  on  a  convergent  deterministic  scheme.  The  Robbins-Monro  and  Kiefer- 
Wolfowitz  procedures,  under  conditions  weaker  than  any  previously  considered,  are  in¬ 
cluded  as  very  special  cases  and,  despite  this  generality,  the  conclusion  is  stronger  since 
our  results  assert  that  the  convergence  is  both  in  mean-square  and  with  probability  1. 

The  main  results  are  stated  in  section  2  and  their  proof  follows  in  sections  3  and  4. 
Various  generalizations  are  given  in  section  5,  while  section  6  furnishes  an  instructive 
counterexample.  The  Robbins-Monro  and  Kiefer-Wolfowitz  procedures  are  treated  in 
section  7.  Because  of  the  generality  of  our  results  the  proofs  in  sections  3  and  4  have  to 
overcome  a  number  of  technical  difficulties  and  are  somewhat  involved.  A  special  case 
of  considerable  scope  where  the  technical  difficulties  disappear  is  discussed  in  section  8. 
This  section  is  essentially  self-contained  and  includes  an  extremely  simple  complete  proof 
of  the  mean-square  convergence  result  in  the  special  case,  which  illustrates  the  underly¬ 
ing  idea  of  our  method.  In  section  8  we  also  find  the  best  (unique  minimax  in  a  non- 
asymptotic  sense)  way  of  choosing  the  an  in  a  special  case  of  the  Robbins-Monro  scheme 
[they  are  of  the  form  c/ ( n  +  c')).  The  concluding  section  9  contains  some  remarks  on  ex¬ 
tensions  to  nonreal  random  variables  and  other  topics.  Since  the  primary  object  of  this 
paper  is  to  give  the  general  approach,  no  attempt  has  been  made  to  study  any  specific 
procedures  except  the  well-known  Robbins-Monro  and  Kiefer-Wolfowitz  schemes  which 
serve  as  illustrations. 

Research  sponsored  in  part  by  the  Office  of  Scientific  Research  of  the  Air  Force  under  contract  AF 
18  (600)442,  Project  R-345-20-7. 
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2.  Statement  of  the  main  results 

Let  (£}  =  {<>)},  rfl,  n)  be  a  probability  space.  X  =  X(a)),  Y  =  F(o>)  and  Z  =  Z(a>), 
as  well  as  the  same  letters  with  primes  or  subscripts  or  both,  will  denote  (real)  random 
variables,  and  the  corresponding  lower-case  letters  will  denote  values  assumed  by  the 
random  variables.  Tn,  T'n  and  T'n\  n  —  1,2,***,  will  denote  measurable  transformations 
from  ^-dimensional  real  space  into  the  reals.  Instead  of  writing  Tn{rh  •  •  • ,  r„)  we  shall 
often  write  Tn(rn)  exhibiting  only  the  last  argument.  E{  }  and  P{  }  will  denote  the  ex¬ 
pected  value  of  the  random  variable  and  the  probability  of  the  event  within  the  braces, 
respectively. 

It  is  difficult  to  strike  the  proper  balance  between  generality  of  result  and  simplicity 
of  statement.  We  shall  first  state  only  a  moderately  general  version  of  our  results  and  fol¬ 
low  it  by  an  extension.  Further  generalizations  will  be  given  in  section  5. 

Theorem.  Let  a„,  /3„  and  yn,  n  =  1,  2,-  •  •,  be  nonnegative  real  numbers  satisfying 


(2.1) 

lima*  =  0  , 

r  « 00 

(2.2) 

00 

2i3"<  00 
n*=  1 

and 

(2.3) 

00 

2  y* =  00  • 

n=l 

Let  6  be  a  real  number  and  Tn,n  =  1,  2,  •  •  • ,  be  measurable  transformations  satisfying 

(2.4)  | Tn  (r i,*  •  • ,  r„)  -  6  |  ^max  [a„,  (1  +  /3«)  !  rn—  6  \  -  yn] 

for  all  real  n,  •  •  • ,  r„.  Let  Xyand  F„,  n  =  1,  2,  •  •  • ,  be  random  variables  and  define 

(2.5)  Xn+\  (a>)  =  Tn  [Xx  («),•••,*„(«)!+  Yn  (a>) 
for  n  S;  1. 

Then  the  conditions  E{X\]  <  00 , 

CO 

(2.6)  F|)  <  » 

»=-l 

and 

(2.7)  E{  Yn  |*i, •••,*.}  -  0 
with  probability  1  for  all  n,  imply 

(2.8)  lim  E{  (Xn-  0)2}  =0 

n—co 

and 

(2.9)  P{  lim  Xn  =  6}  =  1  . 

n=°o 

The  main  difficulty  is  in  proving  (2.8);  once  this  is  done  (2.9)  follows  by  a  simple  de¬ 
vice.  In  the  theorem  a„,  j3n  and  the  restoring  effect  y„  are  assumed  independent  of  the 
observations  xi,-  *  •,  x„.  This  need  not  be  so  and  the  following  statement  dispenses  with 
this  assumption. 

Extension.  The  theorem  remains  valid  if  an,  (3n  and  yn  in  (2.4)  are  replaced  by  non- 
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negative  functions  a„(ri,  •  •  • ,  rB),  /SB(rx,  •  •  • ,  r„)  a«d  7„(ri,  •  •  • ,  r„),  respectively ,  provided  they 
satisfy  the  conditions:  The  functions  an(ri,  •  •  *,  rn)  are  uniformly  hounded  and 

(2.10)  lim  an  (rx, •••,  rB)  =  0 

n  =  °° 

uniformly  for  all  sequences  ft,  •  •  • ,  r„,  •  •  • ;  the  functions  /3„(ri,  •  •  • ,  r„)  are  measurable  and 

00 

(2.11)  r.) 

»“  1 

is  uniformly  hounded  and  uniformly  convergent  for  all  sequences  rif  •  *  • ,  rn,  •  •  • ;  awd  the 
functions  yn(ri,  •  ■  • ,  r„)  satisfy 

00 

(2.12)  <y^7»(*pi>,,,»  r»)  =  00 

»=i 

uniformly  for  all  sequences  rx,  •  •  • ,  rn,  •  •  •  ,/ar  wAicA 

(2.13)  sup  j  rB  |  <  L 

n-l,  2)  •  •  • 

L  an  arbitrary  finite  number. 

We  shall  refer  to  the  theorem  and  its  extension  together  as  the  extended  theorem. 
Condition  (2.13)  was  introduced  for  the  functions  7n  because  of  its  use  in  applications 
(see  section  7).  Further  generalizations  will  be  given  following  the  proof. 

3.  Proof  of  the  theorem 

Throughout  the  proofs,  in  this  and  the  following  section,  we  assume  0  =  0.  This 
involves  no  loss  of  generality.  Let  m  be  any  positive  integer  and  A  any  set  in  4?  the 
definition  of  which  can  be  made  in  terms  of  X\,  •  •  • ,  Xm.  We  shall  first  show  that  if  a  is 
any  number  satisfying  a  ^  am  then 

(3.1)  f  [  (|  XTO+i|  —  a)  +1 2dn 

J  A 

£  f  {«/3„(l  +  a0„)  +  F*„+  (l  +  3»)2(l  +  a/3„)l(|X„|-a)+lM<lM 

where,  as  customary,  r+  =  (|r|  -f  r)/2  denotes  the  positive  part  of  r. 

If  Z  =  Z(&>)  is  any  random  variable  satisfying  |Z|  ^  a  for  all  to  then  clearly 

(3.2)  f  l  (|Xm+,|  -a)  +]*dii£f  (X„+,-Z)sd/i 

~  f[Tm  (Xm)  -Z+FJ^m- 

If  Z  is  defined  in  terms  of  .XV  •  •,  .Xm  it  then  follows  from  (2.7)  that 

(3.3)  f[Tn(Xm)  -Z+VJ  =  f[Tm(Xm)  -Z\*d„  +  fY\,dp. 

Taking  Z  =  Tm(Xm)  if  i  Tm(Xm)  |  ^  a,  Z  =  a  if  Tm(Xm)  >  aandZ  =  —a  if  Tm(Xm)  < 
—a  we  have 

(3.4)  f  ir„  (XJ  -Z]  Hn  =  fu\Tm  (XJ  |  -  o)  +)  Hu 
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Since  a  ^  am  we  note  that,  by  (2.4),  we  have  |  Tm(rm)  |  —  a  ^  (1  +  $m)a  —  a  =  a(3m 
whenever  \rm\  ^  a,  while  otherwise  irm(rm)|  -  a  ^  (1  +  /3m)|rm|  -  a  =  (1  +  /Sm) 
(jrm|  —  a)  +  afim.  Thus  we  have  in  all  cases 

(3.5)  (|r.(rj  I  -o)  +  £  (1  +  0J  (\rm\-a)+  +  a(im. 

Using  the  inequality  ( u  +  v)2  ^  (1  +  v)u2  +  z»(l  +  »)  which  is  valid  for  any  v  ^  0  we 
obtain 

(3.6)  l(|  Tm(rm)  |  —  a)  +] 2  ^  {1  +  afiJ  (1  + Pm)2[  (\rm  \  -  a) +]* 

H -ClPm  (1  +#0m)  . 

Combining  (3.2),  (3.3),  (3.4)  and  (3.6)  we  obtain  (3.1). 

Let  now  «  ^  m  and  assume  that  a  ^  max  ay.  Then  iterating  (3.1)  gives  imme- 

m&j&n 

diately 

(3.7)  f^H\X»\-a)  +]  Hy.  S  ».. .+  2  f«'+  I  ( I  I  -  «)  +1 

where 

(3.8)  fl  (l  +  W!(l  +  aW,  ft  U  +  «M 

)'“»  j’—m 

[when  n  =  m  both  sides  of  (3.7)  are  identical  provided  void  sums  and  products  are  in¬ 
terpreted  as  0  and  1,  respectively]. 

Let  c  >  0  be  given.  Choose  a  =  a(e)  so  that 

(3.9)  0<agV|- 

Then  choose  an  integer  k  =  k(a,  e).  >  1  so  that 

(3.10)  max  a y  a  , 

iZk-i 

and 

(3.11)  «»(»*+ ^£{F?))s| 
where 

OO  CO 

(3.12)  «*=  (1 +  /3y) 2  (1 +  a^y)  ,  =  a  ^  fty  (1  +  affy) ; 

y— i  j— * 

such  an  integer  k  exists  in  virtue  of  (2.1)  and  the  fact  that,  by  (2.2)  and  (2.6),  all  in¬ 
finite  series  and  products  involved  are  convergent. 

For  every  j  and  w  put 

(3.13)  Sj  =  Sj  (w)  =  sgn  Ty  (Xy) 

where  sgn  r  denotes,  as  usual,  1  when  r  >  0,  —  1  when  r  <  0  and  0  when  r  =  0.  For 7  >  1 
let  By  and  B"  denote  the  events  described  by 

(3.14)  By=  {«:  sgn  Xy5*  5y-i} 

and 

(3.15) 


Bf  =  (w: |Ty— 1  (Xy— x)  \  &a\. 
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Put  By  =  B}-  u  B Y  and,  for  m  ^  k, 

(3.16)  Am  =  Bm-”u1B;. 

Finally,  let 

(3.17)  Tn=UAm,  a„  =  a  -  r« 

m“ifc 

for  every  n  Si  k. 

From  (2.5)  and  (3.14)  it  follows  that  | Xm|  ^  jFm_i|  throughout  B^,,  while  from 
(2.5)  and  (3.15)  we  have  \Xm\  +  |F,n_i|  in  B".  Thus  \Xm\  —a%  |  F„_i| 
throughout  Bm  and,  in  particular,  in  Am.  Hence  it  results  from  (3.7),  (3.8)  and  (3.12) 
that 


(3.18)  f  [(|X„|-«)+]  (vk+  V  Yfidu 

PeL  1  ' 


whenever  n  ^  tn  ^  k.  Since  the  sets  Am  are  disjoint,  it  follows  from  (3.10)  on  summing 
the  inequalities  (3.18)  that 

(3.19)  /  [(|X.|-«)+)*<iMg|. 

As  \Xn\  ^  (|Z„|  —  a)+  +  a,  it  follows  at  once  from  (3.19)  and  (3.9)  that 

(3.20)  frXld„Z2(±+fTa'd>)SZ± 
for  every  n  ^  k. 

Now  let  us  turn  to  An.  By  (3.14)  we  have  outside  B,- 

(3.21)  \Xj  \  =  Xj  sgn  Xj  =  Sj-tTj-!  (Xy-i)  +  5y_,  Y ^ 

=  ITj-xiXj-x)  |  +  Sj—i  Y  y— i , 
while  outside  B,-'  we  have  for  j  ^  k  by  (3.15)  and  (3.10) 

(3.22)  1 3T*y — i  (Xj—i)  (l  +  fc-i)  |  XHx  |  -  yy-x . 

Hence  outside  By  we  have 

(3.23)  |  Xj |  g  (1  +  /3y-i)  |  Xy-i  1  -  7,-1+  */- 1  Fy-x 

whenever.;  ^  k.  Since  An  is  contained  in  the  complement  of  By  for  every  k  <  j  ^  «,  we 
can  in  An  iterate  the  inequalities  (3.23)  and  obtain  for  « in  An 

n  n 

(3.24)  |  Xn\  ^Wk,n\Xk  \  -  Ym  +  ^  SmWm,nYm 

m—k  m—k 

where 

n — 1 

(3.25)  v>m,n=Yl  (1  +  ^)‘ 

i—m 

Putting 

n  n 

(3.26)  £n  =  IVk,  n  |  A”fc  |  y mt  Zn  =  S mWm,  n  1  in 


THIRD  BERKELEY  SYMPOSIUM!  DVORETZKY 


44 


we  obtain  from  (3.24) 


(3.27) 

/  xUn^f  i  (z.+zi)  +]  V,, 

s  f  (z.++iz;i)VM. 

JK 

Hence 

£  XU^^2/a  (Z„+)Vm  +  2_^  \Z'n\\ 

(3.28) 

But  by  (3.26)  and  (2.7),  since  sm  is  defined  by  Tm(Xm), 


(3.29) 


®  m=k 


and  hence 

oo  oo 

(3.30)  f  n  d  +  W’S^lF*.) 

14  HA  8 

by  (3.11)  and  (3.12). 

Finally,  we  remark  that  if  Z  is  any  random  variable  with  E{Z2}<  »  then 
E{[{Z  —  r)+]2}  tends  to  zero  as  r— >  +  o> .  By  (3.7)  with  w  =  1  and  n  =  k  the  random 

00 

variable  Xk,  and  hence  also  Z  =  |X*|  J~J  (1  +  fij)  have  finite  second  moments.  But, 

n 

by  (3.26),  Zt  Z  —  ^  ym  and  it  follows  from  (2.3)  and  the  remark  made  at  the  be- 

m—h 

ginning  of  this  paragraph  that 

(3.31)  f  (£)'d^E{ 

for  all  n  >  N  —  N(e,  k ). 

Combining  (3.20),  (3.27),  (3.30)  and  (3.31)  we  have  E{X\)  <  e  for  n>  N.  Since 
e  >  0  is  arbitrary  this  completes  the  proof  of  (2.8). 

The  proof  of  (2.9)  will  now  be  easily  achieved.  Applying  (3.7)  with  A  =  Q  we  can 
obtain  for  all  n  >  m  an  inequality  of  the  form 


(3.32) 


E{Xl)  <  ZZ|max  a/, 


’£f}i,E{X1m}  +  '*TEl 


where  H  is  an  explicit  function  of  the  three  exhibited  variables,  monotone  increasing  in 
each  and  tending  to  zero  as  all  three  of  them  tend  to  zero.  The  important  thing  for  us  is 
that  H  does  not  depend  on  Xj,  Tj,  Y,  except  in  the  exhibited  manner.  In  particular,  the 
7n  do  not  enter  into  (3.32),  and  it  remains  valid  even  if  all  of  them  are  zero. 

Given  5  >  0  and  «  >  0  there  exists  7?  =  r?(52e)  such  that  if  all  three  arguments  of  H  are 
smaller  than  rj,  then  H  <  <52e.  By  (2.1),  (2.2)  and  (2.8)  there  exists  m  —  m(rj)  satisfying 
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Let  this  m  be  fixed  and  define  Xy,  T'j}  Fy  as  follows:  Xy  =  Xy  for  j  ^  m,  Tj  =  and 
Fy  =  Y j  for  7  <  tn,  while  for  other  values  of  7 


i  Tj  (ri,-*  •  >  r}) 


(3.34)  T'j  (fi,-  •  • ,  ri)  =  YJ  (fl’ 

1  r  y 

and  Fy,  Xy+i  are  defined  recursively  by 

(3.35) 


if  I  rj  |  <  5 
if  I  rj  |  ^  8 


rHo' 


ii\X'-\ <  8 
ifWI  ^  8 


and 

(3.36) 


Xj+i  =  Tj  (Xy)  +  Fy . 


Clearly  the  T'n  satisfy  (2.4)  with  yn  =  0  and  we  also  have  E{  Y'n\x[,-  •  •,  x'}  =  0  and 
E{  F'2}  ^  £{  F2}.  Since  X^  =  Xm  it  follows  from  (3.32)  and  (3.33)  that 


(3.37)  £{X'2}<526 

for  m.  According  to  the  definition  of  X'n  the  relation  |Xy|  ^  5  for  some  j  ^  w  im¬ 
plies  |X'|  ^  5  for  all  w  ^  j.  Hence  we  have  for  all  n  ^  m 

(3.38)  P{  max  |Xy|^$}  £P{  |X;|^S}. 

Combining  (3.37)  and  (3.38)  we  have 

(3.39)  P{sup |Xy|  >  8}  <e; 


8  and  e  being  arbitrary,  (2.9)  follows  and  the  proof  is  completed. 


4.  Proof  of  the  extension 

We  first  remark  that  (3.1)  holds  provided  a  ^  sup  am(rh ■  •  •,  rm)  for  all  rh  -  •  rm 
and  /STO  is  considered  as  a  function  of  Xi,  •  •  • ,  XTO.  Hence  (3.8)  also  holds  provided 
a  ^  sup  ay(ri,  •  •  • ,  r  j)  for  all  m  S- j  <  n  and  all  nt  *  •  • ,  rh  while  um,  „  and  vm,  „  are  the  su- 
prema  of  the  expressions  on  the  right  side  of  the  equalities  (3.8)  for  all  sequences  ri,  •  •  * , 
r„,  •  •  • .  Also,  given  any  e  and  a,  there  exists  according  to  the  assumptions  of  the  exten¬ 
sion  an  integer  k  satisfying  a  ^  sup  a/fi,  •  •  • ,  r,)  for  all  7  ^  k  —  1  and  all  r\,  •  •  • ,  ry  and 
(3.11)  where  uk  and  vk  are  defined  as  the  suprema  of  the  expressions  on  the  right  sides  of 
the  equalities  (3.12).  Therefore  (3.20)  holds. 

Always  considering  /3y  and  7 y  as  functions  of  Xi,  •  •  • ,  Xy  and  replacing  the  infinite 
product  in  (3.30)  by  its  sup  for  all  sequences  #v  •  •,  r„,  •  •  •  we  see  that  everything  up  to 
and  including  (3.30)  carries  through.  Had  we  assumed  (2.12)  uniformly  for  all  sequences 
nt-  •  •,  rn,-  •  •,  (3.31)  would  have  also  followed  as  before;  since  only  a  weaker  assump¬ 
tion  was  made,  a  slightly  more  sophisticated  argument  is  needed  for  its  proof. 

We  note  that  (3.32)  remains  valid  provided  the  first  two  arguments  are  replaced  by 
their  suprema.  Let  M  be  a  positive  number  and  define  X,-',  T and  Y”  as  follows: 
X 1  —  Xi,  T]'  is  given  for  all  7  ^  1  by  (3.34)  with  8  replaced  by  M  and  F",  Xy+x  for 
y  ^  1  given  recursively  by  (3.35)  with  X,-  and  5  replaced  by  Xy '  and  M,  and  (3.36)  with 
primes  replaced  by  double  primes.  Then,  exactly  as  in  (3.37),  we  have 

E{X'*)<Ex 


(4.1) 
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(4.2)  3.  =  H  (sup  «„  sup Pi,E\X\\  +  ) 

'  J-l  J-l  S 

the  suprema  being  taken  over  ally  and  all  sequences  *v  •  *,  rn>*  •  *.  Hence,  as  in  (3.38) 
and  (3.39), 

(4.3)  P{sup|X,|  >U)  £  sup  P 1  |X"  | 

Thus  the  sequence  Xn  is  bounded  with  probability  1. 

Let  us  now  return  to  Zn  as  defined  in  (3.26).  Putting 

00 

(4.4)  wk  =  sup  J"J  [  1  +  ( ru  •  ■  • ,  r,)  ] 

for  all  sequences  rh  •  •  • ,  rj,  ■  •  •  we  have 

(4.5)  2.Sw*|Xs|  -  £r.(*i."-.  XJ 


( wk ,  unlike  wk,  n  and  wm,  „,  is  a  positive  constant).  Since  Xk  has  finite  second  moment 
there  exists  f  =  f(c,  k)  >  0  such  that 


(4.6) 


< 


16  w\ 


whenever  fix  is  a  measurable  set  satisfying  P{fix}  <  f.  Let  now 

(4.7)  M  = 

and  denote  by  fix  the  set  of  co  for  which  sup  |X„|  >  M  and  by  fi2  the  complementary 

i 

set.  Then  by  (4.3),  (4.5),  (4.6)  and  (4.7) 

(4.8)  f  (Zt) 1  djt  £wl  f  X 

Jqx  J  Qj  lo 

On  fi2,  however,  ^  7m(Xi,  •  •  • ,  Xm)  diverges  uniformly  by  assumption  and  hence,  by  the 
argument  leading  to  (3.31),  we  have 

(4.9)  /0i(Z.+)Vu<^, 

for  n>  N'  =  N'(e,  f,  k).  The  inequalities  (4.8)  and  (4.9)  imply  (3.31)  and  hence  (2.8). 
The  proof  of  (2.9)  then  follows  word  for  word  as  in  the  preceding  section. 


5.  Generalizations 

The  requirements  (2.4)  are  satisfied  by  most  nonstochastic  approximation  schemes. 
Thus  (2.4)  (with  6  =  0)  is  weaker  than 

(5.1)  \Tn  (fx,  •  •  • ,  rn)  |  ^max  la„,  (l-{-/5»  — 7n)  \rn\] 

with  an,  /3n,  7„  satisfying  (2.1),  (2.2)  and  (2.3).  Indeed,  if  (2.3)  is  satisfied  then  there 
exists  a  sequence  pn,  n  =  1,  2,*  •  of  positive  numbers  tending  to  zero  and  having  the 
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property  that]^  ynpn  is  divergent.  The  second  term  under  the  maximum  in  (5.1)  is  al¬ 
ways  less  than  or  equal  to  max  [(1  -f  0n)pn,  (1  +  /3„)  |  r«  |  —  7»P»]*  Thus  (5.1)  implies 
(2.4)  with  a»  replaced  by  max  [a„,  (1  +  |Sn)pn]  and  yn  replaced  by  y »pn.  Since  these  re¬ 
placements  do  not  affect  conditions  (2.1)  and  (2.3)  it  follows  that  (5.1)  is  subsumed  un¬ 
der  (2.4).  Of  course  if  we  are  interested  in  the  rate  of  convergence  it  may  be  better  to  use 

(5.1)  directly  than  to  reduce  it  to  (2.4).  Some  remarks  in  this  direction  will  be  found  in 
section  8. 

Similarly 

(5.2)  I  Tn  (fl,  •  •  '  ,  fn)  |  ^  max  [  On,  (1+  /?*)  1  r»  |  —  7n+  3»1 

with  a„,  /3„,  7n  as  before  and  5„  ^  0  satisfying 

00 

(5.3)  2  s»<  “ 

n-1 

is  only  deceptively  more  general  than  (2.4).  To  see  this  we  remark  that  in  view  of  (4.3) 
there  exists  a  sequence  of  positive  numbers  X„,  n  —  1,  2,  *  -  • ,  tending  to  infinity  slowly 

enough  so  that^P  Xn5„  is  convergent.  The  second  term  under  the  max  in  (5.2)  being  al¬ 
ways  less  than  or  equal  to  max  [(1  +  0n)A «  +  5„,  (1  +  /3„  +  Xn5n)  |  rn  \  —  7„]  it  fol¬ 
lows  that  (5.2)  implies  (2.4)  with  a„  replaced  by  ai  —  max  [a„,  (1  +  /3n)/Xn  -f  5„]  and  /3„ 
replaced  by  j8'  =  /3n  +  X„5n.  Since  these  replacements  do  not  affect  conditions  (2.1)  and 
(2.2)  our  assertion  is  proved.  Similar  remarks  to  those  made  above  concerning  (2.4) 
apply  also  to  the  extension  with  8n(rh  •  • ,  r„)  satisfying  the  same  requirements  as 
0(»V '  *,  rn). 

The  possibility  of  deducing  our  results  under  the  assumption  (5.2)  has,  however,  an 
important  consequence  allowing  the  weakening  of  condition  (2.7).  This  weakening  may 
be  useful  in  some  applications,  especially  when  dealing  with  certain  rounding-off  errors. 
Generalization  1.  The  Extended  Theorem  remains  valid  if  (2.7)  is  replaced  by 
00 

(5.4)  2  sup  |£{  Yn\xi}---7  I  <  00  , 

n— I  *l*"‘  '  *n 


or  even  by  the  condition  that 

00 

(5.5)  2£(K"I 

fl  — 1 

be  uniformly  bounded  and  uniformly  convergent  for  all  sequences  xi,  •••,*»,••• . 

Indeed,  putting  F„  =  Yn  —  E{Yn |*i,*  •  *,  xn}  and  T'n(xh •  ••,»„)  =  Tn(xh ■  •*,*„)  + 
E[Yn  |  jci,  *  *  * ,  xn\  we  have  Xn+i  =  T„(Xn)  +  Y'n.  If  (5.4)  holds,  then  the  transforma¬ 
tions  T'  satisfy  (5.2)  with  8n  —  sup  \E[  Yn\xx,-  ■  *,  xn)  \  which,  by  (5.4),  satisfy  (5.3), 
while  the  Y'n  satisfy  the  conditions  imposed  on  Yn  in  (2.6)  and  (2.7)  since  E{  Y%\  | 
2E{  Yl\  4-  25*.  A  similar  argument  applies  when  (5.5)  holds. 

Another  sometimes  useful  extension  is  the  following. 

Generalization  2.  Conclusion  (2.9)  of  the  Extended  Theorem  remains  valid  even  with¬ 
out  any  restrictions  on  X\  and  if  (2.5)  is  replaced  by 

x.+1=r.u.)  +  r; 


(5.6) 
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with  the  random  variables  Y*,  n  =  1,  2,  •  •  • ,  satisfying 

(5.7)  P{  F*  9^  Y„  for  infinitely  many  n\  =  0  , 
thus,  in  particular,  when 

CO 

(5.8) 

n=l 

Indeed,  if  £2'  denotes  the  set  where  |  Xm  \  <  M  and  £2"  the  set  where  Y*  =  F„  for 
all  n  ^  m  then  it  follows  from  the  Extended  Theorem  that  P{Xn— »  0[£2'  n  £2"}  =  1. 
Since  P{£2'  n  £2"}  can  be  made  arbitrarily  close  to  1  the  result  follows.  This  simple 
generalization  may  often  be  used  in  order  to  reduce  the  study  to  the  case  when  the  ran¬ 
dom  variables  are  bounded. 

Our  proof  was  arranged  in  such  a  manner  that  it  yields  also  the  following. 
Generalization  3.  If  (2.1)  is  replaced  by 

(5.9)  liman  =  a, 

n=oo 

or,  more  generally,  (2.10)  by 

(5.10)  limanfri,***,  r„)  ^  a 

n=co 

uniformly  for  all  sequences  r\,  •  •  • ,  rn,  •  •  •  then  the  Extended  Theorem  remains  valid  pro¬ 
vided  (2.8)  and  (2.9)  are  replaced  by 

(5.11)  (Xn- 0)2}  ^  a2 

n=oo 

and 

(5.12)  P{Ik^|Xn|  ^  a}  =  1 . 

n— oo 

Another  type  of  generalization  which  is  useful  can  be  exemplified  by  the  following. 
Generalization  4.  The  Extended  Theorem  remains  valid  if  the  assumptions  concern¬ 
ing  a„(fi,  •  •  • ,  r„)  are  replaced  by  the  following:  ai(Xi)  is  bounded  with  probability  1, 
a„(Xi,  •  ■  •,  Xn)  ^  a*+i(Xi,-  ■  •,  Xn,  X„+i)  with  probability  1  and 

(5.13)  P{  lima.  (Xi, •••,*.)  =0}  =1  . 

n=oo 

Indeed,  denoting  by  a  an  upper  bound  in  probability  of  ai(Xi),  by  £2m  the  set  where 
am{X\,'  •  •,  Xm)  <  €  and  by  £2„  its  complement  we  have 

(5.14)  B\Xl\  -P(Q.)  £{X:|Q') 

and  (5.11),  with  0  =  0  for  brevity,  gives 

(5.15)  Hm£{.£}  Se!  +  a2P{Q^). 

n— oo 

Since  P{£2^,}  — »0  as  m— *  oo  by  (5.13),  we  have  (2.8);  the  proof  of  (2.9)  is  exactly  the 
same. 

The  last  generalization  we  wish  to  present  extends  the  class  of  transformations  Tn. 
Instead  of  considering  transformations  Tn  determined  by  x\,  xi,  •  •  • ,  xn  we  may  consider 
random  ones  depending  on  the  sample  point  w,  that  is,  measurable  mappings  of  R  X  £2 
into  R,  R  being  the  real  line.  In  this  case  x\,  •  •  • ,  xn  do  not  determine  the  value  tn  as- 
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sumed  by  Tn(Xn).  However,  except  for  this  fact  which  necessitates  a  restatement  of  (2.7), 
nothing  is  changed  in  all  our  arguments.  Hence  we  have 

Generalization  5.  The  Extended  Theorem  remains  valid  also  if  Tn,  n  =  1,  2,  •  •  • ,  are 
random  transformations  provided  (2.4)  holds  for  all  ca  and  (2.7)  is  replaced  by 

(5.16)  E{  F» | *i,  —  ,  z„,  tn}  =0 

with  probability  1. 

All  the  above  generalizations  may  be  used  in  conjunction.  Many  similar  ones  can  easi¬ 
ly  be  given. 

6.  A  counterexample 

The  very  generality  of  our  results  might  lead  us  to  suspect  that  even  weak  restrictions 
of  the  type  of  (2.4)  or  its  generalizations  on  the  Tn  are  entirely  superfluous.  In  other  words, 
one  might  be  tempted  to  conjecture  that  whenever  we  have  a  sequence  of  transforma¬ 
tions  Tn(rn)  =  Tn(rlt  •  •  • ,  rn)  of  w-space  into  the  reals  having  the  property  that  for  every 
m  and  #v  •  ‘,rm  the  sequence 

(6. 1 )  r  1  T m  (  r  n)  ,  *  '  '  j  r  m+n+1  T m+n  (  r  m+n)  } 

converges  to  0  then  E[X\)  <  <»,  (2.6)  and  (2.7)  already  imply  (2.8)  or  (2.9).  The  fol¬ 
lowing  simple  example  shows  that  this  is  not  the  case. 

Let  qn  and  vn,  n  =  1,  2,  -  •  be  two  sequences  of  positive  numbers  with  qn  <  1  and 

such  that  both  series  ^  qn  and  ^  v\/qn  are  convergent;  for  instance,  qn=v „  =  l/(n2+ 1). 

Put  sn—  Vi  H - h  vn  and  let  Tn  depend  only  on  its  last  argument  and  be  defined  by 

T„(r„)  =  s^ i  for  r„  =  s^i  and  Tn(rn)  =  0  otherwise.  No  matter  what  rx,  •  •  • ,  rm  are, 
all  members  of  (5.1)  from  the  second  on  are  zero.  Let  now  Xh  Yh  •  •  • ,  F„,  ■  •  •  be  mutually 
independent  with  Xx  =  0  and  Fn  assuming  the  two  values  vn  and  —  (1  —  qn)vn/qn  with 
probabilities  1  -  qn  and  qn,  respectively.  Clearly  E{  F„}  =  0  and  E{  F2}  <  3vl/qn,  and 
thus  (2.6)  and  (2.7)  are  satisfied.  On  the  other  hand,  the  probability  that  Xn+i  =  sn  for 

every  n  equals  the  probability  that  F„  =  vn  for  every  n  and,  being  equal  to  rid-?-)- 

is  positive.  Hence  not  only  (2.8)  and  (2.9)  fail  to  hold  but  Xn  does  not  even  converge  in 
probability  to  zero.  [In  this  example  the  Tn  are  discontinuous;  this  is  easily  remedied.  All 
we  have  to  do  is  to  define  Tn(r„ )  =  for  rn  =  sn-i,  Tn(rn )  =  0  for  rn  ^  Sn- %  or 
rn  ^  sn  and  by  linear  interpolation  for  the  remaining  values  of  r».] 

7.  The  Robbins-Monro  and  Kiefer-Wolfowitz  procedures 

In  this  section  we  deal  with  a  very  special  case  of  the  general  theory.  It  will  be  shown 
that  specializing  the  general  results  will,  without  further  ado,  improve  the  best  results 
previously  obtained  for  the  specific  procedures. 

Let  Zu  be  a  one-parameter  family  of  random  variables,  the  parameter  space  being  the 
real  line,  and  assume  that 

(7.1)  /(«)  =E{ZU] 

exists  for  every  u.  The  Robbins-Monro  and  Kiefer-Wolfowitz  procedures  are  concerned 
with  finding,  under  suitable  assumptions,  the  location  of  the  root  /(«)  =  0  and  of  the 
maximum  of  the  regression  function  f(u). 
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The  Robbins-Monro  procedure  is  based  on  a  sequence  of  positive  numbers  a„,  n  =  1, 
2,  •  •  • ,  satisfying 

CO  00 

(7.2)  5>n=CO,  2  a"<C°  • 

n— 1  n— 1 

Then,  starting  with  an  arbitrary  xit  it  defines  recursively  a  sequence  xn,  n  =  1,  2,  •  •  • ,  by 

(7.3)  #n+l  =  QfSbi 


where -g,  is  an  observation  on  the  random  variable  ZXn. 

The  Kiefer-Wolfowitz  procedure  is  based  on  two  sequences  of  positive  numbers  bn  and 
cn  satisfying 


(7.4) 


CO 


<  00 . 


Then,  starting  with  an  arbitrary  Xi,  it  defines  recursively  a  sequence  xn,  n  —  1,  2,  •  •  • ,  by 


(7.5)  O 

Cn 

where  zn  and  z"  are  observations  on  ZXn+cn  and  ZXn-en,  respectively. 

Result  1.  If  the  Zu  have  uniformly  bounded  variances  and  if  the  regression  function  f(u) 
is  measurable  and  satisfies 

(7.6)  !/(«)!<  A\u\+B<  oo 


for  all  u  and  suitable  A  and  B,  and 

(7.7)  inf  /(«)  > 0, 

i/k<u-e<k 


sup  /  («)  <  0 

1  /k<B-u<k 


for  all  integers  k;  then  the  Robbins-Monro  sequence  (7.3)  converges  to  d  both  in  mean  square 
and  with  probability  1. 

Indeed,  an  underlying  probability  space  can  be  constructed  in  which  xn  is  an  observa¬ 
tion  on  the  random  variable  Xn)  then  X\  =  x\  and 

(7.8)  Xn+l  =  Xn-anf(Xn)  +Fn 
with 

(7.9)  Yn  =  an[ZXn- f{Xn)  ]. 

From  (7.1)  we  have  (2.7)  while  the  assumption 

(7.10)  £{Zu}^<r2<oo 

for  all  u  gives  E{  F„}  S  o^al  and  thus,  by  (7.2),  (2.6)  holds.  Assume,  for  simplicity  of 
writing,  0  =  0  and  let  p„,  n  =  1,  2,-  •  be  a  sequence  of  positive  numbers  tending  to 

zero  and  for  which  ^  pnan  =  00  ;  and  let  rjn  be  also  a  null  sequence  of  positive  numbers 
having  the  property  that  inf  | /(«)  |  >  pn.  By  (7.6)  we  have  \u\  —  an  \f(u)  \  >  —Ban 

for  all  n  >  no,  while  given  any  L  >  0  we  have  \u\  —  a„|/(«)j  <  \u\  —  anpn  for  all 
Vn  <  H  <  L  and  n  >  n^  Thus  the  transformations  Tn(rh  •  •  •,  rn)  =  r„  —  Onf(rn )  oc¬ 
curring  in  (7.8)  satisfy,  for  large  n,  condition  (2.4)  with  a„  =  max  (%,  Ban)  and  yn  =  On p». 

Since  ^  anpn  is  divergent,  the  result  follows  from  the  Extended  Theorem.  (The  argument 
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could  be  somewhat  simplified  by  using  generalization  3;  the  introduction  of  the  se¬ 
quences  p„  and  rjn  could  then  have  been  avoided.) 

Remark  1.  Condition  (7.6)  is  necessary  in  order  to  dampen  the  restoring  effect  of 
— -a„z„.  This  is  illustrated  by  the  following  simple  example:  an  =  1/n,  Z„  =  f(u)  =  u\u\ 
with  probability  1.  Taking  x\  —  3  we  have  x2  =  3  —  32  =  —  6,  x3  =  —  6  -f-  62/2  =  12, 

•  •  •  and  it  is  easily  verified  that  |  xn  |  — ►  « .  Condition  (7.6)  can  be  shown  to  be  the  only 
one  of  its  type  that  will  eliminate  this  phenomenon  for  all  sequences  an\  for  any  specific 
sequence  this  condition  can,  of  course,  be  somewhat  relaxed.  However,  it  should  be 
emphasized  that  in  practice,  condition  (7.6)  causes  no  trouble.  Indeed,  in  all  practical 
situations  one  knows  in  advance  that  the  root  8  lies  in  some  finite  interval  (Ci,  C2).  Then, 
provided  /(«)  is  bounded  in  (Ci,  C2)  one  can  then  replace  Zu  by,  say,  4-1  for  u  >  C2 
and  by  —  1  for  u  <  C\.  [Such  a  replacement  also  substitutes  a  stronger  version  of  (7.7) ; 
it  is  no  longer  necessary  to  consider  the  possibility  that  \f(u )  |  may  become  very  small 
as  |  u  |  — >  oo  and  result  1  would  in  this  case  follow  directly  from  the  theorem.] 

We  now  proceed  to  deal  with  the  Kiefer-Wolfowitz  scheme.  We  denote  by  Df(u)  and 
Df(u)  the  upper  and  lower  derivatives  of  /(«): 

D/(u)  -  to 

o^h — >0  " 

(7'U)  PJ(U)  -  to 

0?ih-¥  0  11 

Result  2.  If  the  Zu  have  uniformly  hounded  variances  and  if  the  regression  function 
f(u)  satisfies 

(7.12)  !/(m+1)  -f(u)  \  <  A\u  \  -\-B  <  oo 
for  all  u  and  suitable  A  and  B ,  and 

(7.13)  sup  Df(u)  <0,  inf  Df  («)  >0 

l/h<v— 0<lc  1 /k<8 — u<k 

for  all  integers  k,  then  the  Kiefer-Wolfowitz  sequence  (7.5)  converges  to  8  both  in  mean  square 
and  with  probability  1. 

Indeed,  putting  X\  =  x\  and 

(7.14)  X.+1  =  X.  +  ^[/(X„+ c.)  -/(X.-c.)  1  +F. 

Cn 

with 

(7.15)  F„  =  -^[Zx„+c„-/(X„+0  -Zx-,n+f(Xn-  c„)), 

we  see,  by  (7.1),  that  Yn  satisfies  (2.7).  Also,  since  (7.10)  holds,  we  have  E{  F£}  < 
2<r25*/cn  and  hence,  by  (7.4),  condition  (2.8)  is  also  satisfied.  Thus,  again  assuming 
8  =*  0,  all  that  remains  to  be  shown  is  that  the  transformations  Tn(r i,  •  •  • ,  r„)  =*  rB  + 
b„[f(rn  +  cn)  —  f(r„  —  cn)]/cn  satisfy  (2.4).  Since  c„  tends  to  zero  we  have  from  (7.12)  the 
inequality  \f(u  +  cn)  —  /(«  —  c„)|  <  A  ]«|  +  A  +  B  for  n  >  na.  Noticing  that  u  and 
f(u  -f-  cn)  —  f(u  —  c„)  have  different  signs  for  |  u  j  >  c„  and  remembering  that  bn/cn  — >  0 
by  (7.4),  we  see  that  given  p  >  0  we  have  |  r„(r„)  |  <  max  [p  +  bn(Ap  +  A  +  B)/cn, 
rn]  for  all  n  >  np.  Also,  given  any  L  >  p  we  have  \f(u  +  c„)  —  /(«  —  c»)  |  >  2ycn 
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where  y  =  min  [—  sup  Z5/(m),  inf  Df(u)]  for  p  <  I  u  I  <  L  and  n  >  tit,. 

t>/2<u<L+l  p/2<-u<L+l 

Hence  Tn  satisfies,  for  large  n,  (2.4)  with  a„  =  2 p,  /3„  =  0,  and  yn  ^  0  satisfying  further¬ 
more  •  •  • ,  r„)  >  27&„  for  |  rn  |  <  Z,.  Since  p  is  arbitrary  our  result  follows  from  gen¬ 
eralization  3.  (The  use  of  generalization  3  could  have  been  avoided  as  in  the  proof  of 
result  1 ;  for  the  sake  of  variety  we  illustrated  both  methods.) 

Remark  2.  Like  (7.6)  in  the  previous  result,  condition  (7.12)  here  has  no  practical  im¬ 
portance  and  is  necessary  for  the  same  reasons. 

The  conclusion  that  xn  converges  to  0  with  probability  1  was  proved  by  Blum  [1]  in 
the  case  of  the  Robbins-Monro  procedure  under  exactly  the  assumptions  made  by  us. 
He  also  proved  the  same  conclusion  for  the  Kiefer-Wolfowitz  procedure  under  the  fol¬ 
lowing  stronger  assumptions  that  /(«)  satisfies  (7.12)  with  .4=0  and  the  condition 
obtained  from  (7.13)  on  replacing  1/k  <  u  —  0  <  k  and  1/k  <  0  —  u  <  k  by  1/k  < 
u  —  0  <  <»  and  1/k  <  0  —  u  <  °°  under  the  sup  and  inf  signs,  respectively  (Blum,  fol¬ 
lowing  Kiefer  and  Wolfowitz,  formulates  his  assumptions  somewhat  differently  but  they 
are  easily  seen  to  be  equivalent  to  those  stated  here).  Blum’s  results  contain  those  due 
to  Robbins-Monro  [7],  Wolfowitz  [8],  Kiefer-Wolfowitz  [6]  and  Kallianpur  [5].  Besides 
the  stronger  conclusion  in  both  cases,  our  result  2  allows  such  regression  functions  as 
f{u )  —  —u2  and  /(«)  =  exp  (— w2)  which  do  not  satisfy  Blum’s  conditions. 

8.  A  special  case 

The  method  of  proof  of  sections  2  and  3  can  be  adapted  to  give  explicit  bounds  for 
E{(Xn  —  0)2},  etc.  Here  we  shall  do  it  only  for  a  special  case,  furnishing  an  extremely 
simple  proof  of  the  theorem  in  this  case. 

Assumption.  The  transformations  Tn  of  (2.5)  satisfy 

(8.1)  |r.(r«.-",  f.)  -»l  SF.|r.-»|, 

Fn,  n  =  1,  2,*  •  *,  being  a  sequence  of  positive  numbers  satisfying 

CO 

(8.2)  rp.-° 

n—l 


Putting  V„  =  E{  (X„  —  0)2}  and  <j\  —  E{  F2}  we  have  at  once  from  (2.5)  and  (8.1) 

(8.3)  Vl+i^Flvl+ol. 

On  iteration  we  have 

(8.4)  Fn+i  =  °n <r»-iF’+-  •  •  +  *lF2m+1Fl+2  •  •  •  Fl 

+  •••  +  o\ftF\  ■  ■  ■  Fl +  V yt  ■  ■  ■  Fl . 

This  is  the  estimate  of  £{X„+i  —  0)2}.  To  prove  (2.8)  we  merely  have  to  remark  that 
by  (8.4) 

(8.5)  Fj+.S  2>y*-maa  ^  +  5'<r?)-max  ]T[  Fj. 

m—k—n  V  /  1  £k<m  y— A+l 

Because  of  (8.2)  all  partial  products  J~J  F)  are  uniformly  bounded  by  a  finite  number  A, 
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00 

say.  Given  any  e  >  0  choose  m  so  large  that  A  of  <  e/2,  then  (8.5)  gives 

i—m 


(8.6) 


Vl+i-£^+(v\+  ^ 


n 


nn- 

i-k+1 


With  m  being  fixed,  the  max  term  in  (8.5)  tends  to  zero  as  n  — >  <»  by  (8.2)  and  hence 
Ff+i  ^  e  for  all  sufficiently  large  tt.  This  proves  (2.8),  and  (2.9)  can  be  deduced  thence 
by  a  simplified  version  of  the  argument  at  the  end  of  section  3.  [If  it  is  assumed  that  all 
Fn  are  1 1  then  the  writing  can  be  somewhat  abbreviated  since  the  first  max  in  (8.5)  is 
simply  1  and  the  second  occurs  at  k  =  m  —  1.] 

We  shall  illustrate  the  use  of  (8.3)  by  proving  the  following  minimax  result  on  the 
Robbins-Monro  procedure. 

Result  3.  If  the  Zu  satisfy  (7.10)  and  if  the  regression  function  f(u )  is  measurable  and 
satisfies 

(8.7)  0<  A  oo 

U  —  u 


and  if  it  is  known  that 

(8.8) 


then  if  we  use  in  the  Robbins-Monro  procedure  the  sequence 


(8.9) 


AO 

a*  or2  +  W  A20 


we  shall  have 

(8.10)  11  (*.-»>»»  Za,+  (n%A*o- 


For  any  other  sequence  an  there  are  Zu  and  Xi  satisfying  all  the  Above  conditions  for  which 

(8.10)  does  not  hold. 

By  (7.8)  the  transformation  Tn(rif  •  •  • ,  rn)  in  the  Robbins-Monro  scheme  is  rn  — 
Onf(rn).  Hence  (taking  0  =  0  throughout  the  proof),  we  have  from  (8.7) 

(8.11)  | r,(f„---,  f«)  |  S  |r.|  •supll-a.^-l 


^  |  rn  \  •  max  ( 1  —  Aan,  Ban  —  1) . 


Thus  if  an— >  0,  (8.1)  holds  for  large  n  with  Fn  =  1  —  Aan.  Therefore  if  the  an  satisfy 
(7.2)  the  assumption  is  verified  and  hence  E{X\\  tends  to  zero.  As  the  sequence  (8.9) 
clearly  satisfies  (7.2)  the  conclusion  holds  in  this  case.  [So  far  no  use  was  made  of  (8.8). 
Also  all  the  above  follows  directly  from  result  1.] 

From  (8.11)  it  follows  that  if 


(8.12) 


an  = 


A+B 


then  Tn  satisfies  (8.1)  with  Fn  =  1  —  Aan;  hence  we  have  in  this  case  according  to  (8.3) 
(8.13)  Vl+i£  (1-  Aan)2Vi+aW 
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by  (7.9)  and  (7.10).  The  minimum  of  the  right  side  of  (8.13)  is  achieved  at 


(8.14) 

and  for  this  choice  of  an  we  have 


AVl 

<r2+ A2F2’ 


(8.15) 


Vn+l  ^ 


2  t/2 

<*  vn 


<r2+ A2F2' 


Also,  by  (8.8),  V\  ^  C2,  this  and  the  recursion  formula  (8.15)  give  (8.10);  and  substitut¬ 
ing  the  right  side  of  (8.10)  for  V„  in  (8.14)  we  obtain  (8.9).  Moreover,  if  the  an  thus  com¬ 
puted  satisfy  (8.12),  and  if  —  C,  and  f(u)  =  Au  for  all  w,  and  the  equality  sign  always 
holds  in  (7.10),  we  have  an  equality  sign  also  in  (8.15).  Thus  our  last  assertion  will  be 
proved  if  we  show  that  the  an  given  by  (8.9)  satisfy  (8.12),  but  this  is  evident  since  they 
form  a  monotone  sequence  and,  by  (8.8),  ai  =  AC1 /{a1  +  ^C2)  ^  2 /{A  +  B). 

It  is  also  easy  to  dispose  of  the  case  when  C  does  not  satisfy  (8.8).  We  merely  have  to 
start  with  a\  =  2/ (A  -f-  B)  and  keep  using  this  value  until,  for  the  first  time,  we  have 
Vm  not  larger  than  the  right  side  of  (8.8).  After  that  we  define  am+ „_i  by  the  right  side 
of  (8.9)  with  Vm  replacing  C. 


9.  Concluding  remarks 

In  all  the  preceding  we  have  dealt  with  real  random  variables.  Our  methods  carry 
over,  however,  to  more  general  situations.  Since  then  it  may  be  impossible  to  multiply 
or  square,  one  has  either  to  operate  with  a  pair  of  adjoint  spaces  or  with  the  norm.  We 
shall  show  how  the  latter  can  be  done  in  the  case  treated  in  the  beginning  of  the  last 
section.  Suppose  X\  and  Yn  assume  values  in  a  normed  linear  space  A/  with  ||r||  denoting 
the  norm  of  r.  Let  0  be  an  element  of  ^and  Tn(rh  •  •  • ,  r„)  be  measurable  transformations 
from  the  wth  Cartesian  power  of  J\f  into  J\f  and  assume  that  ||Tn(ri,-  •  *,  r„)  —  0||  ^  Fn- 
||rn  —  0||  with  Fn  >  0  satisfying  (8.2).  Then,  if  we  put  Xn+i  —  Tn(Xn)  -f-  F„,  the  as¬ 
sumptions  £{||Xi||2}  +  ^£{||F„||2}  <  oo  and 

(9.1)  E{  ^)  +  F„||2}  ^JS{  |k (*i,---,  xn)  ||2}  +E{  ||  Fn||2} 

for  every  measurable  function  <p(xi,-  •  •,  xn)  imply  £{||Xn||2}  —^0  and  \\Xn  —  0||  — > 0 
with  probability  1.  Except  for  substituting  norm  instead  of  absolute  value  not  one  word 
in  the  proof  is  changed.  We  replaced  the  condition  (2.7)  by  (9.1)  since,  in  the  case  of  real 
variables,  (2.7)  was  used  solely  to  have  E{<pnYn\  —  0  and  thus  obtain  (9.1)  with  the 
sign  of  equality.  As,  in  general,  we  cannot  multiply  we  assumed  (9.1)  to  start  with.  This 
condition  is  related  to  orthogonality,  and  in  many  important  cases  may  be  deduced  from 
relations  similar  to  (2.7).  What  has  been  said  here  about  the  special  case  treated  in  sec¬ 
tion  7  can  be  suitably  extended  to  cover  the  general  case.  So  far  as  we  know,  the  only 
treatment  of  nonreal  random  variables  in  this  connection  is  Blum’s  study  [2]  of  the 
Robbins-Monro  and  Kiefer-Wolfowitz  schemes  for  random  variables  assuming  values  in 
finite  dimensional  Euclidean  space. 

Chung  [3]  studied  for  special  cases  of  the  Robbins-Monro  procedure  the  asymptotic 
distribution  of  Xn  —  6.  Under  the  general  assumptions  of  our  theorem  nothing  can,  of 
course,  be  asserted  about  asymptotic  distributions.  Such  assertions  would  certainly 
necessitate  assumption  of  lower  bounds  on  the  effectiveness  of  the  transformations  Tn 
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and  very  much  else.  Our  methods,  however,  do  give  bounds  for  the  second  moment  of 
Xn  —  6  and,  assuming  higher  moments  for  Xi  and  Yn,  can  be  used  to  obtain  bounds  for 
the  corresponding  moments  of  Xn  —  6.  In  this  connection  the  inequalities  of  Doob  (see 
chapter  8,  section  3  in  [4])  can  be  useful.  (Doob’s  theorems  may  also  serve  to  give  a 
modified  proof  of  our  main  results.) 

The  general  theory  embraces  naturally  many  other  schemes  besides  those  of  Robbins- 
Monro  and  Kiefer-Wolfowitz.  It  may  also  be  modified  to  yield  methods  of  obtaining  con¬ 
fidence  intervals  and  the  like. 


0  0  0  0  0 

Note  added  in  proof.  Since  submitting  the  paper  the  author  became  aware  of  the  follow¬ 
ing  two  studies. 

(a)  D.  L.  Burkholder,  “On  a  certain  class  of  stochastic  approximation  processes,” 
Mimeograph  Series  No.  129,  Institute  of  Statistics,  University  of  North  Carolina  (1955). 

(b)  C.  Derman,  “An  application  of  Chung’s  lemma  to  the  Kiefer-Wolfowitz  stochastic 
approximation  procedure,”  to  appear  in  Annals  of  Math.  Stat. 

The  most  relevant  results  of  these  papers  are  a  proof  of  the  probability  1  part  of 
Result  2  of  section  7  in  (a),  and  studies  of  the  asymptotic  distribution  in  special  cases 
of  the  Kiefer-Wolfowitz  procedure  in  both  (a)  and  (b). 
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